To Do:
- dont’ force 2 axes on simulated data? (sometimes models fail though. Would need to use a “safe” version of
opls())
Overview
After discussion with Elizabeth we decided to create the following combination of datasets:
- Without signal (no covariance) and without discriminating variables. We would predict poor PCA and no separation by PLS-DA
- With signal (covariance) and without discriminating variables. We would expect good PCA (high % var explained), but no separation in PCA or PLS-DA
- Without signal (covariance) and with discriminating variables. We would expect poor PCA, but significant PLS-DA
- With signal and discriminating variables. We would expect good PCA, but little separation in PCA space if only a few discriminating variables, but significant PLS-DA.
I am consider making a Shiny app that allows you to tweak the parameters that generate the data and plots PCA and PLS-DA side-by-side.
Permutation testing: What about using each of these 4 parameter combinations to create ~100 randomly generated datasets. Then with each, I could do PCA followed by t-tests on the PC axes vs. PLS-DA to demonstrate which is better at detecting real separation in the data. Maybe the cherry-picked figure is enough, but this would only be a few sentences in the manuscript and would be valuable I think.
To-Do: Add MANNOVA or PERMANOVA to this
Setup
Load Packages
# Required Packages
library(MASS)
library(tidyverse)
library(ropls)
library(chemhelper)
#chemhelper contains the sim_multvar() function I wrote as well as custom functions for interfacing with ropls package in a friendlier way.
#Install with devtools::install_github("Aariq/chemhelper")
library(cowplot) #for making and saving prettier plots
library(broom) #for tidy(), which turns model output into dataframes
library(vegan) #for PERMANOVA
library(iheatmapr) #for correlation heatmaps
Create Custom Plotting Functions
# Functions for plotting
library(latex2exp)
pca_plot <- function(pca.opls, group_var){
plotdata <- chemhelper::get_plotdata(pca.opls)
ggplot(plotdata$plot_data, aes(x = p1, y = p2, color = group_var)) +
geom_point() +
stat_ellipse() +
labs(x = paste0("PC1 (", plotdata$var_explained$R2X[1] * 100, "%)"),
y = paste0("PC2 (", plotdata$var_explained$R2X[2] * 100, "%)")) +
scale_colour_discrete("Group Membership") +
theme_bw() +
ggtitle("PCA") +
labs(caption = TeX(
paste0(nrow(plotdata$var_explained), " principal components;",
"$R^2(cumulative) = ", max(plotdata$var_explained$`R2X(cum)`, "$"))))
}
plsda_plot <- function(plsda.opls){
plotdata <- chemhelper::get_plotdata(plsda.opls)
ggplot(plotdata$plot_data, aes(x = p1, y = p2, color = y1)) +
geom_point() +
stat_ellipse() +
labs(x = paste0("P1 (", plotdata$axis_stats$R2X[1] * 100, "%)"),
y = paste0("P2 (", plotdata$axis_stats$R2X[2] * 100, "%)")) +
scale_color_discrete("Group Membership") +
theme_bw() +
ggtitle("PLS-DA") +
labs(caption = TeX(
paste0("$R^{2}_{Y} = ", plotdata$model_stats$`R2Y(cum)`, "$; ",
"$Q^{2} = ", plotdata$model_stats$`Q2(cum)`, "$; ",
"$p_{Q^{2}} = ", plotdata$model_stats$pQ2, "$")))
}
oplsda_plot <- function(oplsda.opls){
plotdata <- chemhelper::get_plotdata(oplsda.opls)
ggplot(plotdata$plot_data, aes(x = p1, y = o1, color = y1)) +
geom_point() +
stat_ellipse() +
labs(x = paste0("Pred (", plotdata$axis_stats$R2X[1] * 100, "%)"),
y = paste0("Ortho (", plotdata$axis_stats$R2X[2] * 100, "%)")) +
scale_color_discrete("Group Membership") +
theme_bw() +
ggtitle("OPLS-DA") +
labs(caption = TeX(
paste0("$R^{2}_{Y}=", plotdata$model_stats$`R2Y(cum)`, "$; ",
"$Q^{2}=", plotdata$model_stats$`Q2(cum)`, "$; ",
"$p_{Q^{2}}=", plotdata$model_stats$pQ2, "$")))
}
Create Custom Summary Table Function
my_table <- function(data, pca, plsda){
VIPs <- get_VIP(plsda)
loadings_pls <- getLoadingMN(plsda) %>%
as.data.frame() %>%
rownames_to_column(var = "Variable") %>%
rename(p1_loading = p1, p2_loading = p2)
loadings_pca <- getLoadingMN(pca) %>%
as.data.frame() %>%
rownames_to_column(var = "Variable") %>%
select(Variable, PC1_loading = p1, PC2_loading = p2)
t_tests <- data %>%
select(-group) %>%
map_df(~t.test(.~data$group)$p.value) %>%
gather(key = Variable, value = t_test_p.value)
join1 <- full_join(VIPs, loadings_pls)
join2 <- full_join(join1, loadings_pca)
join3 <- full_join(join2, t_tests)
return(join3)
}
Create safe version of opls()
Occasionally I was getting ‘matrix is singular’ type errors from opls(), which I think is just a chance occurrance with the random data. This “safe” version will always return NULL when there is any kind of error.
safe.opls <- possibly(opls, otherwise = NULL)
Set global options (needle in haystack scenario)
I’ll set a few global options here to make it easier to play around with the simulation.
These options basically create a situation where when there are real differences between groups, it’s only due to a small percentage of variables. Other variables can have mild covariation or none at all.
N = 30 #total number of observations
P = 40 #total number of variables
signal.prop = 0.625 #when there are correlated variables, what proportion?
cov_corvar = 0.8 #covariance for signal variables
disc.prop = 0.125 #when there are discriminating variables, what proportion?
diff_discr = 1.2 #mean difference between groups for discriminating variables
nperm = 50 #how many perumutations
seed = 100 #seed
#calculated values
(p_corvar = round(P*signal.prop))
[1] 25
(p_discr = round(P*disc.prop))
[1] 5
1. No Covariance, No Discrimination
Generate data
sim.data1 <- sim_multvar(p_uncorvar = P,
p_corvar = 0,
p_discr = 0,
cov_corvar = 0,
diff_discr = 0,
N = N,
seed = seed)
sim.data1 %>% select(-group) %>% as.matrix() %>% cor() %>% iheatmap(row_labels = TRUE, col_labels = TRUE)
PCA & PLS-DA
Run PCA
sim.pca1 <- opls(select(sim.data1, -group), plotL = FALSE)
PCA
30 samples x 40 variables
standard scaling of predictors
R2X(cum) pre ort
Total 0.555 7 0
Run PLS-DA
sim.plsda1 <- try(opls(select(sim.data1, -group), sim.data1$group,
plotL = FALSE))
Error : No model was built because the first predictive component was already not significant;
Select a number of predictive components of 1 if you want the algorithm to compute a model despite this.
This fails, but I’ll force two axes for the sake of plotting.
## Must force axes
sim.plsda1 <- opls(select(sim.data1, -group), sim.data1$group,
predI = 2,
permI = 200,
plotL = FALSE)
PLS-DA
30 samples x 40 variables and 1 response
standard scaling of predictors and response(s)
R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
Total 0.148 0.751 -0.765 0.263 2 0 0.64 0.36
Plots:
p1 <- plot_grid(pca_plot(sim.pca1, sim.data1$group) +
theme(legend.position = "none") +
ggtitle("PCA", subtitle = "No covariance, no discriminating variables"),
plsda_plot(sim.plsda1) +
theme(legend.position = "none") +
ggtitle("PLS-DA", subtitle = "No covariance, no discriminating variables"))
p1

Get PC axis loadings and VIP scores.
Note: It’s probably not very responsible to look at VIP scores from a PLS-DA model that is not significant
my_table(sim.data1, sim.pca1, sim.plsda1) %>% arrange(desc(VIP))
Joining, by = "Variable"
Joining, by = "Variable"
Joining, by = "Variable"
Permutation testing
First, make a bunch of datasets with the same parameters
sim.data1.list <- map(1:nperm,
~sim_multvar(p_uncorvar = P,
p_corvar = 0,
p_discr = 0,
cov_corvar = 0,
diff_discr = 0,
N = N,
seed = NA))
names(sim.data1.list) <- 1:nperm
Now, do PCA on all of them and get scores
Now do t-tests on all of them along PCs 1 and 2 and get p-values
#The 'group' variable is the same in all datasets.
grouping <- sim.data1.list[[1]]$group
p1.testresults <- scores.data1.list %>%
#maps t.test() function to all dataframes and converts output into dataframe with tidy()
map_dfr(~t.test(.$p1 ~ grouping) %>% tidy(), .id = "dataset") %>%
#selects just columns of interest
select(dataset,
PC1.effect.size = "estimate",
PC1.t = "statistic",
PC1.p.value = "p.value")
p2.testresults <- scores.data1.list %>%
map_dfr(~t.test(.$p2 ~ grouping) %>% tidy(), .id = "dataset") %>%
select(dataset,
PC2.effect.size = "estimate",
PC2.t = "statistic",
PC2.p.value = "p.value")
data1.PCAresults <- bind_cols(p1.testresults, p2.testresults)
And do PERMANOVA on all of them
data1.permanova <- map_dbl(sim.data1.list,
~ adonis(select(., -group) ~ grouping, method = "eu")$aov.tab$`Pr(>F)`[1])
Now do PLS-DA on all of them and get p-values (this will take a long time)
data1.PLSresults <- pls.data1.list %>%
map_dfr(~get_plotdata(.)$model_stats, .id = "dataset")
data1.comparison <- full_join(data1.PCAresults, data1.PLSresults) %>% add_column("permanova" = data1.permanova)
p1.perm <- ggplot(data1.comparison) +
geom_density(aes(PC1.p.value), fill = "blue", alpha = 0.33) +
geom_density(aes(pQ2), fill = "red", alpha = 0.33) +
geom_density(aes(permanova), fill = "green", alpha = 0.33) +
labs(x = "p value") +
geom_vline(xintercept = 0.05, linetype = 5) +
ggtitle("1. -Covariance, -Discriminating variables")
p1.perm

2. Yes Covariance, No Discrimination
PCA & PLS-DA
Run PCA
sim.pca2 <- opls(select(sim.data2, -group), plotL = FALSE)
PCA
30 samples x 40 variables
standard scaling of predictors
R2X(cum) pre ort
Total 0.523 3 0
Run PLS-DA
sim.plsda2 <- try(opls(select(sim.data2, -group), sim.data2$group,
plotL = FALSE))
Error : No model was built because the first predictive component was already not significant;
Select a number of predictive components of 1 if you want the algorithm to compute a model despite this.
This fails, but I’ll force two axes for the sake of plotting.
## Must force axes
sim.plsda2 <- opls(select(sim.data2, -group), sim.data2$group,
predI = 2,
permI = 200, plotL = FALSE)
PLS-DA
30 samples x 40 variables and 1 response
standard scaling of predictors and response(s)
R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
Total 0.247 0.317 -3.43 0.436 2 0 1 1
Plots:
p2 <- plot_grid(pca_plot(sim.pca2, sim.data2$group) +
theme(legend.position = "none") +
ggtitle("PCA", subtitle = "Yes covariance, no discriminating variables"),
plsda_plot(sim.plsda2) +
theme(legend.position = "none") +
ggtitle("PLS-DA", subtitle = "Yes covariance, no discriminating variables"))
p2

PC1 slightly better, but overall R^2 the same. PLS-DA still not significant
Get PC axis loadings and VIP scores.
Note: It’s probably not responsible to look at VIP scores from a PLS-DA model that is not significant
my_table(sim.data2, sim.pca2, sim.plsda2) %>% arrange(desc(VIP))
Joining, by = "Variable"
Joining, by = "Variable"
Joining, by = "Variable"
Permutation testing
First, make a bunch of datasets with the same parameters
sim.data2.list <- map(1:nperm,
~sim_multvar(p_uncorvar = P - p_corvar,
p_corvar = p_corvar,
p_discr = 0,
cov_corvar = cov_corvar,
diff_discr = 0,
N = N,
seed = NA))
names(sim.data2.list) <- 1:nperm
Now, do PCA on all of them and get loadings.
Now do t-tests on all of them along PCs 1 and 2 and get p-values
#The 'group' variable is the same in all datasets.
grouping <- sim.data2.list[[1]]$group
p1.testresults <- scores.data2.list %>%
#maps t.test() function to all dataframes and converts output into dataframe with tidy()
map_dfr(~t.test(.$p1~grouping) %>% tidy(), .id = "dataset") %>%
#selects just columns of interest
select(dataset,
PC1.effect.size = "estimate",
PC1.t = "statistic",
PC1.p.value = "p.value")
p2.testresults <- scores.data2.list %>%
map_dfr(~t.test(.$p2~grouping) %>% tidy(), .id = "dataset") %>%
select(dataset,
PC2.effect.size = "estimate",
PC2.t = "statistic",
PC2.p.value = "p.value")
Error in model.frame.default(formula = .$p2 ~ grouping) :
invalid type (NULL) for variable '.$p2'
And do PERMANOVA on all of them
data2.permanova <- map_dbl(sim.data2.list,
~ adonis(select(., -group) ~ grouping, method = "eu")$aov.tab$`Pr(>F)`[1])
Now do PLS-DA on all of them and get p-values (this will take a long time)
data2.PLSresults <- pls.data2.list %>% map_dfr(~get_plotdata(.)$model_stats, .id = "dataset")
data2.PLSresults
data2.comparison <- full_join(data2.PCAresults, data2.PLSresults) %>%
add_column("permanova" = data2.permanova)
p2.perm <- ggplot(data2.comparison) +
geom_density(aes(PC1.p.value), fill = "blue", alpha = 0.33) +
geom_density(aes(pQ2), fill = "red", alpha = 0.33) +
geom_density(aes(permanova), fill = "green", alpha = 0.33) +
labs(x = "p value") +
geom_vline(xintercept = 0.05, linetype = 5) +
ggtitle("2. + Covariance, - Discriminating variables")
p2.perm

3. No Covariance, Yes Discrimination
PCA & PLS-DA
Run PCA
sim.pca3 <- opls(select(sim.data3, -group), plotL = FALSE)
PCA
30 samples x 40 variables
standard scaling of predictors
R2X(cum) pre ort
Total 0.534 6 0
Run PLS-DA
PLS-DA
30 samples x 40 variables and 1 response
standard scaling of predictors and response(s)
R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
Total 0.133 0.614 0.331 0.322 1 0 0.265 0.015
Single component model only, so force two axes for the sake of plotting:
sim.plsda3 <- opls(select(sim.data3, -group), sim.data3$group,
permI = 200,
predI = 2,
plotL = FALSE)
PLS-DA
30 samples x 40 variables and 1 response
standard scaling of predictors and response(s)
R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
Total 0.195 0.794 -0.0334 0.239 2 0 0.23 0.06
Plots:
p3 <- plot_grid(pca_plot(sim.pca3, sim.data3$group) +
theme(legend.position = "none") +
ggtitle("PCA", subtitle = "No covariance, yes discriminating variables"),
plsda_plot(sim.plsda3) +
theme(legend.position = "none") +
ggtitle("PLS-DA", subtitle = "No covariance, yes discriminating variables"))
p3

PLS-DA is still not great. Single component model is signifcant. Forced 2 axes for PLS-DA plot. It might be the case that some co-variance between discriminating variables is required for PLS-DA to pull them out. If they are completely orthogonal, how can it draw the “regression line” through 5-dimensional space?
Get PC axis loadings and VIP scores.
my_table(sim.data3, sim.pca3, sim.plsda3) %>% arrange(desc(VIP))
Joining, by = "Variable"
Joining, by = "Variable"
Joining, by = "Variable"
Permutation testing
First, make a bunch of datasets with the same parameters
sim.data3.list <- map(1:nperm,
~sim_multvar(p_uncorvar = P - p_discr,
p_corvar = 0,
p_discr = p_discr,
cov_corvar = 0,
diff_discr = diff_discr,
N = N,
seed = NA))
Now, do PCA on all of them and get loadings.
Now do t-tests on all of them along PCs 1 and 2 and get p-values
#The 'group' variable is the same in all datasets.
grouping <- sim.data3.list[[1]]$group
p1.testresults <- scores.data3.list %>%
#maps t.test() function to all dataframes and converts output into dataframe with tidy()
map_dfr(~t.test(.$p1~grouping) %>% tidy(), .id = "dataset") %>%
#selects just columns of interest
select(dataset,
PC1.effect.size = "estimate",
PC1.t = "statistic",
PC1.p.value = "p.value")
p2.testresults <- scores.data3.list %>%
map_dfr(~t.test(.$p2~grouping) %>% tidy(), .id = "dataset") %>%
select(dataset,
PC2.effect.size = "estimate",
PC2.t = "statistic",
PC2.p.value = "p.value")
data3.PCAresults <- bind_cols(p1.testresults, p2.testresults)
And do PERMANOVA on all of them
data3.permanova <- map_dbl(sim.data3.list,
~ adonis(select(., -group) ~ grouping, method = "eu")$aov.tab$`Pr(>F)`[1])
Now do PLS-DA on all of them and get p-values (this will take a long time)
data3.PLSresults <- pls.data3.list %>% map_dfr(~get_plotdata(.)$model_stats, .id = "dataset")
data3.PLSresults
data3.comparison <- full_join(data3.PCAresults, data3.PLSresults) %>% add_column("permanova" = data3.permanova)
p3.perm <- ggplot(data3.comparison) +
geom_density(aes(PC1.p.value), fill = "blue", alpha = 0.33) +
geom_density(aes(pQ2), fill = "red", alpha = 0.33) +
geom_density(aes(permanova), fill = "green", alpha = 0.33) +
labs(x = "p value") +
geom_vline(xintercept = 0.05, linetype = 5) +
ggtitle("3. - Covariance, + Discriminating variables")
p3.perm

4. Yes Covariance, Yes Discrimination
PCA & PLS-DA
Run PCA
sim.pca4 <- opls(select(sim.data4, -group), plotL = FALSE)
PCA
30 samples x 40 variables
standard scaling of predictors
R2X(cum) pre ort
Total 0.565 3 0
Run PLS-DA
sim.plsda4 <- opls(select(sim.data4, -group), sim.data4$group,
permI = 200,
plotL = FALSE)
PLS-DA
30 samples x 40 variables and 1 response
standard scaling of predictors and response(s)
R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
Total 0.115 0.518 0.159 0.359 1 0 0.015 0.05
This produces a 1-component model. I’ll force two axes for the sake of plotting:
sim.plsda4 <- opls(select(sim.data4, -group), sim.data4$group,
permI = 200,
predI = 2,
plotL = FALSE)
PLS-DA
30 samples x 40 variables and 1 response
standard scaling of predictors and response(s)
R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
Total 0.354 0.604 -0.143 0.332 2 0 0.085 0.335
Plots:
p4 <- plot_grid(pca_plot(sim.pca4, sim.data4$group) +
theme(legend.position = "none") +
ggtitle("PCA", subtitle = "Yes covariance, Yes discriminating variables"),
plsda_plot(sim.plsda4) +
theme(legend.position = "none") +
ggtitle("PLS-DA", subtitle = "Yes covariance, Yes discriminating variables"))
p4

Get PC axis loadings and VIP scores.
my_table(sim.data4, sim.pca4, sim.plsda4) %>% arrange(desc(VIP))
Joining, by = "Variable"
Joining, by = "Variable"
Joining, by = "Variable"
Permutation testing
First, make a bunch of datasets with the same parameters
sim.data4.list <- map(1:nperm,
~sim_multvar(p_uncorvar = P - p_corvar - p_discr,
p_corvar = p_corvar,
p_discr = p_discr,
cov_corvar = cov_corvar,
diff_discr = diff_discr,
cov_discr = 0.2,
N = N,
seed = NA)) %>%
set_names(1:nperm)
Now, do PCA on all of them and get loadings.
Now do t-tests on all of them along PCs 1 and 2 and get p-values
#The 'group' variable is the same in all datasets.
grouping <- sim.data4.list[[1]]$group
p1.testresults <- scores.data4.list %>%
#maps t.test() function to all dataframes and converts output into dataframe with tidy()
map_dfr(~t.test(.$p1~grouping) %>% tidy(), .id = "dataset") %>%
#selects just columns of interest
select(dataset,
PC1.effect.size = "estimate",
PC1.t = "statistic",
PC1.p.value = "p.value")
# p2.testresults <- scores.data4.list %>%
# map_dfr(~t.test(.$p2~grouping) %>% tidy(), .id = "dataset") %>%
# select(dataset,
# PC2.effect.size = "estimate",
# PC2.t = "statistic",
# PC2.p.value = "p.value")
# data4.PCAresults <- bind_cols(p1.testresults, p2.testresults)
data4.PCAresults <- p1.testresults
And do PERMANOVA on all of them
data4.permanova <- map_dbl(sim.data4.list,
~ adonis(select(., -group) ~ grouping, method = "eu")$aov.tab$`Pr(>F)`[1])
Now do PLS-DA on all of them and get p-values (this will take a long time)
data4.PLSresults <- pls.data4.list %>% map_dfr(~get_plotdata(.)$model_stats, .id = "dataset")
data4.PLSresults
data4.comparison <- full_join(data4.PCAresults, data4.PLSresults) %>% add_column("permanova" = data4.permanova)
p4.perm <- ggplot(data4.comparison) +
geom_density(aes(PC1.p.value), fill = "blue", alpha = 0.33) +
geom_density(aes(pQ2), fill = "red", alpha = 0.33) +
geom_density(aes(permanova), fill = "green", alpha = 0.33) +
labs(x = "p value") +
geom_vline(xintercept = 0.05, linetype = 5) +
ggtitle("4. + Covariance, + Discriminating variables")
p4.perm

Plot example datasets

Plot permutation testing results

Conclusions
None of the methods give more false positives than they should—that is, no differences are found when there are no discriminating variables.
PERMANOVA seems to perform the best in these simulations, followed by PLS-DA. PCA is shit at finding separation when discriminating variables are a small portion of the total variables measured (needle in a haystack scenario).
When covariance is added to the “haystack”, or to both the “haystack” and the “needle”, the performance of PERMANOVA drops, while the performance of PLS-DA remains about the same. I should play around with this in a separate notebook.
Trying alternate method of making discriminating variables
I can’t even figure out how to make PLS-DA find a difference.
sim.data5 <- sim.vcov(p_noise = 0,
p_signal = 15,
p_disc = 15,
cov_signal = 0.8,
cov_disc = 0.6,
diff_disc = 0,
var = 0.8,
N = 20,
seed = NA)
sim.data5 <- sim.data5 %>%
rename(Y = disc_1) %>%
arrange(Y) %>%
mutate(group = c(rep("a", nrow(.)/2), rep("b", nrow(.)/2)))
pca5 <- opls(select(sim.data5, -Y, -group), predI = 2)
PCA
20 samples x 29 variables
standard scaling of predictors
R2X(cum) pre ort
Total 0.882 2 0

pca_plot(pca5, sim.data5$group)

plsda5 <- opls(select(sim.data5, -group, -Y), sim.data5$group, permI = 200, predI = 2)
PLS-DA
20 samples x 29 variables and 1 response
standard scaling of predictors and response(s)
R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
Total 0.882 0.507 0.327 0.381 2 0 0.08 0.025

get_VIP(plsda5) %>% arrange(desc(VIP))
pls5 <- opls(select(sim.data5, -group, -Y), sim.data5$Y, permI = 200, predI = 2)
PLS
20 samples x 29 variables and 1 response
standard scaling of predictors and response(s)
R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
Total 0.88 0.64 0.471 0.498 2 0 0.02 0.005

Huh, that doesn’t work
plotdata <- sim.data5 %>%
gather(-group, -Y, key = variable, value = data) %>%
mutate(vartype = case_when(str_detect(variable, "disc") ~ "discriminating",
str_detect(variable, "noise") ~ "no covariance",
str_detect(variable, "signal") ~ "covarying"))
ggplot(plotdata, aes(x = Y, y = data, color = vartype)) +
geom_point() +
geom_smooth(method ="lm", se = FALSE)

Ah ha! Is it because the discriminating data actually does NOT co-vary? No, I upped the covariance and it doesn’t help.
Permanova
permdata5 <- sim.data5 %>%
# select(-Y) %>%
mutate_if(is.double, scale)
adonis(select(permdata5, -group, -Y)~permdata5$group*permdata5$Y, method = "eu")
Call:
adonis(formula = select(permdata5, -group, -Y) ~ permdata5$group * permdata5$Y, method = "eu")
Permutation: free
Number of permutations: 999
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
permdata5$group 1 93.01 93.015 3.8219 0.16881 0.027 *
permdata5$Y 1 54.00 53.997 2.2187 0.09800 0.104
permdata5$group:permdata5$Y 1 14.59 14.587 0.5994 0.02647 0.546
Residuals 16 389.40 24.338 0.70672
Total 19 551.00 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
---
title: "Simulated Data Analyzed with PCA and PLS-DA"
author: Eric R Scott
output: 
  html_notebook: 
    code_folding: hide
    highlight: haddock
    theme: flatly
    toc: yes
    toc_float: yes
---
# To Do:
- dont' force 2 axes on simulated data? (sometimes models fail though.  Would need to use a "safe" version of `opls()`)


# Overview
After discussion with Elizabeth we decided to create the following combination of datasets:

1. Without signal (no covariance) and without discriminating variables.  We would predict poor PCA and no separation by PLS-DA
2. With signal (covariance) and without discriminating variables.  We would expect good PCA (high % var explained), but no separation in PCA or PLS-DA
3. Without signal (covariance) and **with** discriminating variables. We would expect poor PCA, but significant PLS-DA
4. With signal **and** discriminating variables.  We would expect good PCA, but little separation in PCA space if only a few discriminating variables, but significant PLS-DA.

I am consider making a Shiny app that allows you to tweak the parameters that generate the data and plots PCA and PLS-DA side-by-side.

*Permutation testing:*
What about using each of these 4 parameter combinations to create ~100 randomly generated datasets.  Then with each, I could do PCA followed by t-tests on the PC axes vs. PLS-DA to demonstrate which is better at detecting real separation in the data.  Maybe the cherry-picked figure is enough, but this would only be a few sentences in the manuscript and would be valuable I think.

To-Do: Add MANNOVA or PERMANOVA to this

# Setup
## Load Packages
```{r message=FALSE, warning=FALSE}
# Required Packages
library(MASS)
library(tidyverse)
library(ropls)
library(chemhelper)
#chemhelper contains the sim_multvar() function I wrote as well as custom functions for interfacing with ropls package in a friendlier way.  
#Install with devtools::install_github("Aariq/chemhelper")
library(cowplot) #for making and saving prettier plots
library(broom) #for tidy(), which turns model output into dataframes
library(vegan) #for PERMANOVA
library(iheatmapr) #for correlation heatmaps
```

## Create Custom Plotting Functions
```{r}
# Functions for plotting
library(latex2exp)
pca_plot <- function(pca.opls, group_var){
  plotdata <- chemhelper::get_plotdata(pca.opls)
  ggplot(plotdata$plot_data, aes(x = p1, y = p2, color = group_var)) +
    geom_point() +
    stat_ellipse() +
    labs(x = paste0("PC1 (", plotdata$var_explained$R2X[1] * 100, "%)"),
         y = paste0("PC2 (", plotdata$var_explained$R2X[2] * 100, "%)")) +
    scale_colour_discrete("Group Membership") +
    theme_bw() +
    ggtitle("PCA") +
    labs(caption = TeX(
      paste0(nrow(plotdata$var_explained), " principal components;",
             "$R^2(cumulative) = ", max(plotdata$var_explained$`R2X(cum)`, "$"))))
}

plsda_plot <- function(plsda.opls){
  plotdata <- chemhelper::get_plotdata(plsda.opls)
  ggplot(plotdata$plot_data, aes(x = p1, y = p2, color = y1)) +
    geom_point() +
    stat_ellipse() +
    labs(x = paste0("P1 (", plotdata$axis_stats$R2X[1] * 100, "%)"),
         y = paste0("P2 (", plotdata$axis_stats$R2X[2] * 100, "%)")) +
    scale_color_discrete("Group Membership") +
    theme_bw() +
    ggtitle("PLS-DA") +
    labs(caption = TeX(
      paste0("$R^{2}_{Y} = ", plotdata$model_stats$`R2Y(cum)`, "$; ",
             "$Q^{2} = ", plotdata$model_stats$`Q2(cum)`, "$; ",
             "$p_{Q^{2}} = ", plotdata$model_stats$pQ2, "$")))
}

oplsda_plot <- function(oplsda.opls){
  plotdata <- chemhelper::get_plotdata(oplsda.opls)
  ggplot(plotdata$plot_data, aes(x = p1, y = o1, color = y1)) +
    geom_point() +
    stat_ellipse() +
    labs(x = paste0("Pred (", plotdata$axis_stats$R2X[1] * 100, "%)"),
         y = paste0("Ortho (", plotdata$axis_stats$R2X[2] * 100, "%)")) +
    scale_color_discrete("Group Membership") +
    theme_bw() +
    ggtitle("OPLS-DA") +
    labs(caption = TeX(
      paste0("$R^{2}_{Y}=", plotdata$model_stats$`R2Y(cum)`, "$; ",
             "$Q^{2}=", plotdata$model_stats$`Q2(cum)`, "$; ",
             "$p_{Q^{2}}=", plotdata$model_stats$pQ2, "$")))
}
```

## Create Custom Summary Table Function
```{r}
my_table <- function(data, pca, plsda){
  VIPs <- get_VIP(plsda)
  
  loadings_pls <- getLoadingMN(plsda) %>%
    as.data.frame() %>%
    rownames_to_column(var = "Variable") %>% 
    rename(p1_loading = p1, p2_loading = p2)
  
  loadings_pca <- getLoadingMN(pca) %>%
    as.data.frame() %>% 
    rownames_to_column(var = "Variable") %>% 
    select(Variable, PC1_loading = p1, PC2_loading = p2)
  
  t_tests <- data %>%
    select(-group) %>%
    map_df(~t.test(.~data$group)$p.value) %>%
    gather(key = Variable, value = t_test_p.value)
  
  join1 <- full_join(VIPs, loadings_pls)
  join2 <- full_join(join1, loadings_pca)
  join3 <- full_join(join2, t_tests)
  return(join3)
}
```

# Create safe version of opls()
Occasionally I was getting 'matrix is singular' type errors from `opls()`, which I think is just a chance occurrance with the random data.  This "safe" version will always return `NULL` when there is any kind of error.
```{r}
safe.opls <- possibly(opls, otherwise = NULL)
```


# Set global options (needle in haystack scenario)
I'll set a few global options here to make it easier to play around with the simulation.

These options basically create a situation where when there are real differences between groups, it's only due to a small percentage of variables. Other variables can have mild covariation or none at all.
```{r global options}
N = 30 #total number of observations
P = 40 #total number of variables
signal.prop = 0.625 #when there are correlated variables, what proportion?
cov_corvar = 0.8 #covariance for signal variables
disc.prop = 0.125 #when there are discriminating variables, what proportion?
diff_discr = 1.2 #mean difference between groups for discriminating variables
nperm = 50 #how many perumutations
seed = 100 #seed

#calculated values
(p_corvar = round(P*signal.prop))
(p_discr = round(P*disc.prop))
```


# 1. No Covariance, No Discrimination
## Generate data
```{r}
sim.data1 <- sim_multvar(p_uncorvar = P,
                      p_corvar = 0,
                      p_discr = 0,
                      cov_corvar = 0,
                      diff_discr = 0,
                      N = N,
                      seed = seed)
sim.data1 %>% select(-group) %>% as.matrix() %>% cor() %>% iheatmap(row_labels = TRUE, col_labels = TRUE)
```

## PCA & PLS-DA
### Run PCA
```{r}
sim.pca1 <- opls(select(sim.data1, -group), plotL = FALSE)
```
### Run PLS-DA
```{r}
sim.plsda1 <- try(opls(select(sim.data1, -group), sim.data1$group,
                  plotL = FALSE))
```
This fails, but I'll force two axes for the sake of plotting.
```{r}
## Must force axes
sim.plsda1 <- opls(select(sim.data1, -group), sim.data1$group,
                   predI = 2,
                   permI = 200,
                   plotL = FALSE)
```
## Plots:
```{r message=FALSE, warning=FALSE}
p1 <- plot_grid(pca_plot(sim.pca1, sim.data1$group) +
                 theme(legend.position = "none") +
                 ggtitle("PCA", subtitle = "No covariance, no discriminating variables"),
               plsda_plot(sim.plsda1) +
                 theme(legend.position = "none") +
                 ggtitle("PLS-DA", subtitle = "No covariance, no discriminating variables"))
p1
```

## Get PC axis loadings and VIP scores.
Note: It's probably not very responsible to look at VIP scores from a PLS-DA model that is not significant
```{r}
my_table(sim.data1, sim.pca1, sim.plsda1) %>% arrange(desc(VIP))
```

## Permutation testing
First, make a bunch of datasets with the same parameters
```{r}
sim.data1.list <- map(1:nperm,
    ~sim_multvar(p_uncorvar = P,
              p_corvar = 0,
              p_discr = 0,
              cov_corvar = 0,
              diff_discr = 0,
              N = N,
              seed = NA))
names(sim.data1.list) <- 1:nperm
```

Now, do PCA on all of them and get scores
```{r message=FALSE, warning=FALSE, include=FALSE}
pca.data1.list <- map(sim.data1.list,
                      ~safe.opls(select(., -group), plotL = FALSE)) %>% compact()

scores.data1.list <- map(pca.data1.list, ~get_plotdata(.) %>% .$plot_data)
```

Now do t-tests on all of them along PCs 1 and 2 and get p-values
```{r}
#The 'group' variable is the same in all datasets.
grouping <- sim.data1.list[[1]]$group

p1.testresults <- scores.data1.list %>%
  #maps t.test() function to all dataframes and converts output into dataframe with tidy()
  map_dfr(~t.test(.$p1 ~ grouping) %>% tidy(), .id = "dataset") %>% 
  #selects just columns of interest
  select(dataset,
         PC1.effect.size = "estimate",
         PC1.t = "statistic",
         PC1.p.value = "p.value")

p2.testresults <- scores.data1.list %>%
  map_dfr(~t.test(.$p2 ~ grouping) %>% tidy(), .id = "dataset") %>%
  select(dataset,
         PC2.effect.size = "estimate",
         PC2.t = "statistic",
         PC2.p.value = "p.value")

data1.PCAresults <- bind_cols(p1.testresults, p2.testresults)
```

And do PERMANOVA on all of them
```{r}
data1.permanova <- map_dbl(sim.data1.list,
    ~ adonis(select(., -group) ~ grouping, method = "eu")$aov.tab$`Pr(>F)`[1])
```


Now do PLS-DA on all of them and get p-values (this will take a long time)
```{r include=FALSE}
pls.data1.list <- map(sim.data1.list,
                      ~safe.opls(select(., -group), grouping, predI = 2, permI = 200, plotL = FALSE))
```
```{r message=FALSE, warning=FALSE}
data1.PLSresults <- pls.data1.list %>%
  map_dfr(~get_plotdata(.)$model_stats, .id = "dataset")


data1.comparison <- full_join(data1.PCAresults, data1.PLSresults) %>% add_column("permanova" = data1.permanova)

p1.perm <- ggplot(data1.comparison) +
  geom_density(aes(PC1.p.value), fill = "blue", alpha = 0.33) +
  geom_density(aes(pQ2), fill = "red", alpha = 0.33) +
  geom_density(aes(permanova), fill = "green", alpha = 0.33) +
  labs(x = "p value") +
  geom_vline(xintercept = 0.05, linetype = 5) +
  ggtitle("1. -Covariance, -Discriminating variables")
p1.perm
```


# 2. Yes Covariance, No Discrimination
## Generate Data:
```{r}
sim.data2 <- sim_multvar(p_uncorvar = P - p_corvar,
                      p_corvar = p_corvar,
                      p_discr = 0,
                      cov_corvar = cov_corvar,
                      diff_discr = 0,
                      N = N,
                      seed = seed)

sim.data2 %>% select(-group) %>% as.matrix() %>% cor() %>% iheatmap(row_labels = TRUE, col_labels = TRUE)
```

## PCA & PLS-DA
### Run PCA
```{r}
sim.pca2 <- opls(select(sim.data2, -group), plotL = FALSE)
```
### Run PLS-DA
```{r}
sim.plsda2 <- try(opls(select(sim.data2, -group), sim.data2$group,
                  plotL = FALSE))
```
This fails, but I'll force two axes for the sake of plotting.
```{r}
## Must force axes
sim.plsda2 <- opls(select(sim.data2, -group), sim.data2$group,
                   predI = 2,
                   permI = 200, plotL = FALSE)
```
### Plots:
```{r message=FALSE, warning=FALSE}
p2 <- plot_grid(pca_plot(sim.pca2, sim.data2$group) +
                 theme(legend.position = "none") +
                 ggtitle("PCA", subtitle = "Yes covariance, no discriminating variables"),
               plsda_plot(sim.plsda2) +
                 theme(legend.position = "none") +
                 ggtitle("PLS-DA", subtitle = "Yes covariance, no discriminating variables"))
p2
```
PC1 slightly better, but overall R^2 the same. PLS-DA still not significant

## Get PC axis loadings and VIP scores.
Note: It's probably not responsible to look at VIP scores from a PLS-DA model that is not significant
```{r}
my_table(sim.data2, sim.pca2, sim.plsda2) %>% arrange(desc(VIP))
```

## Permutation testing
First, make a bunch of datasets with the same parameters
```{r}
sim.data2.list <- map(1:nperm,
                      ~sim_multvar(p_uncorvar = P - p_corvar,
                                   p_corvar = p_corvar,
                                   p_discr = 0,
                                   cov_corvar = cov_corvar,
                                   diff_discr = 0,
                                   N = N,
                                   seed = NA))
names(sim.data2.list) <- 1:nperm
```

Now, do PCA on all of them and get loadings.
```{r message=FALSE, warning=FALSE, include=FALSE}
pca.data2.list <- map(sim.data2.list,
                      ~safe.opls(select(., -group), plotL = FALSE)) %>% compact()

scores.data2.list <- map(pca.data2.list, ~get_plotdata(.) %>% .$plot_data)
```

Now do t-tests on all of them along PCs 1 and 2 and get p-values
```{r}
#The 'group' variable is the same in all datasets.
grouping <- sim.data2.list[[1]]$group

p1.testresults <- scores.data2.list %>%
  #maps t.test() function to all dataframes and converts output into dataframe with tidy()
  map_dfr(~t.test(.$p1~grouping) %>% tidy(), .id = "dataset") %>% 
  #selects just columns of interest
  select(dataset,
         PC1.effect.size = "estimate",
         PC1.t = "statistic",
         PC1.p.value = "p.value")

# p2.testresults <- scores.data2.list %>%
#   map_dfr(~t.test(.$p2~grouping) %>% tidy(), .id = "dataset") %>%
#   select(dataset,
#          PC2.effect.size = "estimate",
#          PC2.t = "statistic",
#          PC2.p.value = "p.value")

# data2.PCAresults <- bind_cols(p1.testresults, p2.testresults)
data2.PCAresults <- p1.testresults
```

And do PERMANOVA on all of them
```{r}
data2.permanova <- map_dbl(sim.data2.list,
    ~ adonis(select(., -group) ~ grouping, method = "eu")$aov.tab$`Pr(>F)`[1])
```

Now do PLS-DA on all of them and get p-values (this will take a long time)
```{r include=FALSE}
pls.data2.list <- map(sim.data2.list,
                      ~safe.opls(select(., -group), grouping, predI = 2, permI = 200, plotL = FALSE)) %>% compact()
```
```{r message=FALSE, warning=FALSE}
data2.PLSresults <- pls.data2.list %>% map_dfr(~get_plotdata(.)$model_stats, .id = "dataset")
data2.PLSresults

data2.comparison <- full_join(data2.PCAresults, data2.PLSresults) %>%
  add_column("permanova" = data2.permanova)

p2.perm <- ggplot(data2.comparison) +
  geom_density(aes(PC1.p.value), fill = "blue", alpha = 0.33) +
  geom_density(aes(pQ2), fill = "red", alpha = 0.33) +
  geom_density(aes(permanova), fill = "green", alpha = 0.33) +
  labs(x = "p value") +
  geom_vline(xintercept = 0.05, linetype = 5) +
  ggtitle("2. + Covariance, - Discriminating variables")
p2.perm
```

# 3. No Covariance, Yes Discrimination
## Generate Data:
```{r}
sim.data3 <- sim_multvar(p_uncorvar = P - p_discr,
                         p_corvar = 0,
                         p_discr = p_discr,
                         cov_corvar = cov_corvar,
                         diff_discr = diff_discr,
                         cov_discr = 0.6,
                         N = N,
                         seed = seed)
sim.data3 %>% select(-group) %>% as.matrix() %>% cor() %>% iheatmap(row_labels = TRUE, col_labels = TRUE)
```

## PCA & PLS-DA
### Run PCA
```{r}
sim.pca3 <- opls(select(sim.data3, -group), plotL = FALSE)
```
### Run PLS-DA
```{r}
sim.plsda3 <- opls(select(sim.data3, -group), sim.data3$group,
                   permI = 200,
                   plotL = FALSE)
```
Single component model only, so force two axes for the sake of plotting:
```{r}
sim.plsda3 <- opls(select(sim.data3, -group), sim.data3$group,
                   permI = 200,
                   predI = 2,
                   plotL = FALSE)
```

### Plots:
```{r message=FALSE, warning=FALSE}
p3 <- plot_grid(pca_plot(sim.pca3, sim.data3$group) +
                 theme(legend.position = "none") +
                 ggtitle("PCA", subtitle = "No covariance, yes discriminating variables"),
               plsda_plot(sim.plsda3) +
                 theme(legend.position = "none") +
                 ggtitle("PLS-DA", subtitle = "No covariance, yes discriminating variables"))
p3
```
PLS-DA is still not great.  Single component model is signifcant.  Forced 2 axes for PLS-DA plot.  It might be the case that some co-variance between discriminating variables is required for PLS-DA to pull them out.  If they are completely orthogonal, how can it draw the "regression line" through 5-dimensional space?

## Get PC axis loadings and VIP scores.
```{r}
my_table(sim.data3, sim.pca3, sim.plsda3) %>% arrange(desc(VIP))
```

## Permutation testing
First, make a bunch of datasets with the same parameters
```{r}
sim.data3.list <- map(1:nperm,
                      ~sim_multvar(p_uncorvar = P - p_discr,
                                p_corvar = 0,
                                p_discr = p_discr,
                                cov_corvar = 0,
                                diff_discr = diff_discr,
                                cov_discr = 0.6,
                                N = N,
                                seed = NA))
```

Now, do PCA on all of them and get loadings.
```{r message=FALSE, warning=FALSE, include=FALSE}
pca.data3.list <- map(sim.data3.list,
                      ~safe.opls(select(., -group), plotL = FALSE)) %>%
  compact() %>%
  set_names(1:nperm)

scores.data3.list <- map(pca.data3.list, ~get_plotdata(.) %>% .$plot_data)
```

Now do t-tests on all of them along PCs 1 and 2 and get p-values
```{r}
#The 'group' variable is the same in all datasets.
grouping <- sim.data3.list[[1]]$group

p1.testresults <- scores.data3.list %>%
  #maps t.test() function to all dataframes and converts output into dataframe with tidy()
  map_dfr(~t.test(.$p1~grouping) %>% tidy(), .id = "dataset") %>% 
  #selects just columns of interest
  select(dataset,
         PC1.effect.size = "estimate",
         PC1.t = "statistic",
         PC1.p.value = "p.value")

p2.testresults <- scores.data3.list %>%
  map_dfr(~t.test(.$p2~grouping) %>% tidy(), .id = "dataset") %>%
  select(dataset,
         PC2.effect.size = "estimate",
         PC2.t = "statistic",
         PC2.p.value = "p.value")

data3.PCAresults <- bind_cols(p1.testresults, p2.testresults)
```

And do PERMANOVA on all of them
```{r}
data3.permanova <- map_dbl(sim.data3.list,
    ~ adonis(select(., -group) ~ grouping, method = "eu")$aov.tab$`Pr(>F)`[1])
```

Now do PLS-DA on all of them and get p-values (this will take a long time)
```{r include=FALSE}
pls.data3.list <- map(sim.data3.list,
                      ~safe.opls(select(., -group), grouping, predI = 2, permI = 200, plotL = FALSE)) %>% compact()
```
```{r message=FALSE, warning=FALSE}
data3.PLSresults <- pls.data3.list %>% map_dfr(~get_plotdata(.)$model_stats, .id = "dataset")
data3.PLSresults

data3.comparison <- full_join(data3.PCAresults, data3.PLSresults) %>% add_column("permanova" = data3.permanova)

p3.perm <- ggplot(data3.comparison) +
  geom_density(aes(PC1.p.value), fill = "blue", alpha = 0.33) +
  geom_density(aes(pQ2), fill = "red", alpha = 0.33) +
  geom_density(aes(permanova), fill = "green", alpha = 0.33) +
  labs(x = "p value") +
  geom_vline(xintercept = 0.05, linetype = 5) +
  ggtitle("3. - Covariance, + Discriminating variables")
p3.perm
```

# 4. Yes Covariance, Yes Discrimination
## Generate Data:
```{r}
sim.data4 <- sim_multvar(p_uncorvar = P - p_corvar - p_discr,
                         p_corvar = p_corvar,
                         p_discr = p_discr,
                         cov_corvar = cov_corvar,
                         diff_discr = diff_discr,
                         cov_discr = 0.6,
                         N = N,
                         seed = seed)
sim.data4 %>% select(-group) %>% as.matrix() %>% cor() %>% iheatmap(row_labels = TRUE, col_labels = TRUE)
```

## PCA & PLS-DA
### Run PCA
```{r}
sim.pca4 <- opls(select(sim.data4, -group), plotL = FALSE)
```
### Run PLS-DA
```{r}
sim.plsda4 <- opls(select(sim.data4, -group), sim.data4$group,
                   permI = 200,
                   plotL = FALSE)
```
This produces a 1-component model.  I'll force two axes for the sake of plotting:
```{r}
sim.plsda4 <- opls(select(sim.data4, -group), sim.data4$group,
                   permI = 200,
                   predI = 2,
                   plotL = FALSE)
```

### Plots:
```{r message=FALSE, warning=FALSE}
p4 <- plot_grid(pca_plot(sim.pca4, sim.data4$group) +
                 theme(legend.position = "none") +
                 ggtitle("PCA", subtitle = "Yes covariance, Yes discriminating variables"),
               plsda_plot(sim.plsda4) +
                 theme(legend.position = "none") +
                 ggtitle("PLS-DA", subtitle = "Yes covariance, Yes discriminating variables"))
p4
```


## Get PC axis loadings and VIP scores.
```{r}
my_table(sim.data4, sim.pca4, sim.plsda4) %>% arrange(desc(VIP))
```

## Permutation testing
First, make a bunch of datasets with the same parameters
```{r}
sim.data4.list <- map(1:nperm,
                      ~sim_multvar(p_uncorvar = P - p_corvar - p_discr,
                                   p_corvar = p_corvar,
                                   p_discr = p_discr,
                                   cov_corvar = cov_corvar,
                                   diff_discr = diff_discr,
                                   cov_discr = 0.2,
                                   N = N,
                                   seed = NA)) %>% 
  set_names(1:nperm)
```

Now, do PCA on all of them and get loadings.
```{r message=FALSE, warning=FALSE, include=FALSE}
pca.data4.list <- map(sim.data4.list,
                      ~safe.opls(select(., -group), plotL = FALSE)) %>%
  compact()
scores.data4.list <- map(pca.data4.list, ~get_plotdata(.) %>% .$plot_data)
```

Now do t-tests on all of them along PCs 1 and 2 and get p-values
```{r}
#The 'group' variable is the same in all datasets.
grouping <- sim.data4.list[[1]]$group

p1.testresults <- scores.data4.list %>%
  #maps t.test() function to all dataframes and converts output into dataframe with tidy()
  map_dfr(~t.test(.$p1~grouping) %>% tidy(), .id = "dataset") %>% 
  #selects just columns of interest
  select(dataset,
         PC1.effect.size = "estimate",
         PC1.t = "statistic",
         PC1.p.value = "p.value")

# p2.testresults <- scores.data4.list %>%
#   map_dfr(~t.test(.$p2~grouping) %>% tidy(), .id = "dataset") %>%
#   select(dataset,
#          PC2.effect.size = "estimate",
#          PC2.t = "statistic",
#          PC2.p.value = "p.value")

# data4.PCAresults <- bind_cols(p1.testresults, p2.testresults)
data4.PCAresults <- p1.testresults
```

And do PERMANOVA on all of them
```{r}
data4.permanova <- map_dbl(sim.data4.list,
    ~ adonis(select(., -group) ~ grouping, method = "eu")$aov.tab$`Pr(>F)`[1])
```

Now do PLS-DA on all of them and get p-values (this will take a long time)
```{r include=FALSE}
pls.data4.list <- map(sim.data4.list,
                      ~safe.opls(select(., -group), grouping, predI = 2, permI = 200, plotL = FALSE)) %>% compact()
```
```{r message=FALSE, warning=FALSE}
data4.PLSresults <- pls.data4.list %>% map_dfr(~get_plotdata(.)$model_stats, .id = "dataset")
data4.PLSresults

data4.comparison <- full_join(data4.PCAresults, data4.PLSresults) %>% add_column("permanova" = data4.permanova)

p4.perm <- ggplot(data4.comparison) +
  geom_density(aes(PC1.p.value), fill = "blue", alpha = 0.33) +
  geom_density(aes(pQ2), fill = "red", alpha = 0.33) +
  geom_density(aes(permanova), fill = "green", alpha = 0.33) +
  labs(x = "p value") +
  geom_vline(xintercept = 0.05, linetype = 5) +
  ggtitle("4. + Covariance, + Discriminating variables")
p4.perm
```

# Plot example datasets
```{r}
plots <- plot_grid(p1,p2,p3,p4, ncol = 1, nrow = 4)
save_plot("figs/fig1.png", plots, ncol = 2, nrow = 4, base_aspect_ratio = 1, base_height = 3.5)
```


# Plot permutation testing results
```{r}
plots.perm <- plot_grid(p1.perm, p2.perm, p3.perm, p4.perm, ncol = 2, nrow = 2)
p.val_figure <- plot_grid(ggplot() + ggtitle("Distr. of p values from t-tests on PC1 (blue), PLS-DA (red), PERMANOVA (green)"), plots.perm, ncol = 1, rel_heights = c(0.1, 1))
p.val_figure
save_plot("figs/fig2.png", p.val_figure, ncol = 2, nrow = 2, base_aspect_ratio = 1.5)
```

# Conclusions
None of the methods give more false positives than they should---that is, no differences are found when there are no discriminating variables.

PERMANOVA seems to perform the best in these simulations, followed by PLS-DA.  PCA is shit at finding separation when discriminating variables are a small portion of the total variables measured (needle in a haystack scenario).

When covariance is added to the "haystack", or to both the "haystack" and the "needle", the performance of PERMANOVA drops, while the performance of PLS-DA remains about the same. I should play around with this in a separate notebook.


# Trying alternate method of making discriminating variables

I can't even figure out how to make PLS-DA find a difference.
```{r}
sim.data5 <- sim.vcov(p_noise = 0,
                      p_signal = 15,
                      p_disc = 15,
                      cov_signal = 0.8,
                      cov_disc = 0.6,
                      diff_disc = 0,
                      var = 0.8,
                      N = 20,
                      seed = NA)

sim.data5 <- sim.data5 %>%
  rename(Y = disc_1) %>% 
  arrange(Y) %>% 
  mutate(group = c(rep("a", nrow(.)/2), rep("b", nrow(.)/2)))
```

```{r}
pca5 <- opls(select(sim.data5, -Y, -group), predI = 2)
pca_plot(pca5, sim.data5$group)
```

```{r}
plsda5 <- opls(select(sim.data5, -group, -Y), sim.data5$group, permI = 200, predI = 2)
```
```{r}
get_VIP(plsda5) %>% arrange(desc(VIP))
```
```{r}
pls5 <- opls(select(sim.data5, -group, -Y), sim.data5$Y, permI = 200, predI = 2)
```

Huh, that doesn't work
```{r}
plotdata <- sim.data5 %>% 
  gather(-group, -Y, key = variable, value = data) %>% 
  mutate(vartype = case_when(str_detect(variable, "disc") ~ "discriminating",
                             str_detect(variable, "noise") ~ "no covariance",
                             str_detect(variable, "signal") ~ "covarying"))
ggplot(plotdata, aes(x = Y, y = data, color = vartype)) +
  geom_point() +
  geom_smooth(method ="lm", se = FALSE)
```
*Ah ha! Is it because the discriminating data actually does NOT co-vary?*
No, I upped the covariance and it doesn't help.

Permanova
```{r}
permdata5 <- sim.data5 %>%
  # select(-Y) %>% 
  mutate_if(is.double, scale)
adonis(select(permdata5, -group, -Y)~permdata5$group*permdata5$Y, method = "eu")
```

